Session 4

This week, you're going to expand on what you've learned last week. First, you'll re-view how to do linear regression with statsmodels, and then learn how to identify outliers, leverage points and influential points. Then, we're going to use the bird data set to do linear regression in multiple variables (multilinear regression). We will make use of the machine-learning library scikit-learn (https://scikit-learn.org) and the statistical visualization library seaborn (https://seaborn.pydata.org). Finally, you'll learn how to use LDA for more complication classification problems.

keywords: regression: outliers, leverage points, multilinear regression, logistic regression — classifiers: Linear Discriminant Analysis (LDA)

Exercise 1:

In this first exercise, you're going to practice how to identify outliers, leverage points and influential points. We'll use a simple linear regression with one predictor.

Multilinear Regression

Exercise 2:

In this second exercise, you'll be working with the bird dataset from last week (bird_data_vincze_etal_2015.csv) to try to predict migration distances.

The usage of linear_model.LinearRegression is slightly different than what you've used before. Here's how it works:

  1. Create the predictor object: reg = linear_model.LinearRegression()
  2. Fit the linear model with your data by calling the fit method: reg.fit(X, y). You can access the score/parameters of the fit with the reg.score() method or printing the reg.coef_ attribute.
  3. Predict using the linear model with reg.predict(X)

Classification with logistic regression and linear discriminant analysis

Logistic regression

In the lecture you've learned about logistic regression, which is the appropriate tool when the dependent variable is dichotomous (binary). We're going to apply this tool the iris flower dataset (https://en.wikipedia.org/wiki/Iris_flower_data_set) and see if we can classify flowers correctly.

The iris dataset is included in scikit-learn.

This function returns a dictionary that contains a bunch of different fields. All the data points are contained in 'data'. The four columns correspond to the following four features.

For every data point, the respective class label is stored in target. There is a total of three classes.

As usual, it's a good idea to have a look at the data before starting with your analysis. Let's do some pairplots with seaborn.

We want to focus on binary classification, and hence we discard all data points belonging to a certain class. This can be done by, for example, selecting all the rows that do not belong to that class. Here, we drop the class 'virginica'.

Exercise 3:

Linear Discriminant Analysis (LDA)

We're going to keep working with the iris dataset from the last exercise. We will use the built-in classifier in scikit-learn.

Exercise 4:

Start off with a scatter plot where the colors indicate the species.